Data Ingestion Patterns

In my last blog I highlighted some details with regards to data ingestion including topology and latency examples. I think this blog should finish up the topic. I will return to the topic but I want to focus more on architectures that a number of opensource projects are enabling. In this blog I want to talk about two common ingestion patterns.

Point to Point

I want to discuss the most used pattern (or is that an anti-pattern), that of point to point integration, where enterprises take the simplest approach to implementing ingestion and employ a point to point approach.

Point to point ingestion employs a direct connection between a data source and a data target.

Point to point ingestion tends to offer long term pain with short term savings.

That is not to say that point to point ingestion should never be used (e.g. short term solution or extremely high performance requirements), but it must be approved and justified as part of an overall architecture governance activity so that other possibilities may be considered. Otherwise point to point ingestion will become the norm.

Point to point data ingestion is often fast and efficient to implement, but this leads to the connections between the source and target data stores being tightly coupled. In the short term this is not an issue, but over the long term, as more and more data stores are ingested, the environment becomes overly complex and inflexible. For instance, if an organization is migrating to a replacement system, all data ingestion connections will have to be re-written.

Overall, point to point ingestion tends to lead to higher maintenance costs and slower data ingestion implementations.

Hub and Spoke

A common approach to address the challenges of point to point ingestion is hub and spoke ingestion.

The hub and spoke ingestion approach decouples the source and target systems. The ingestion connections made in a hub and spoke approach are simpler than in a point to point approach as the ingestions are only to and from the hub. The hub manages the connections and performs the data transformations.

The hub and spoke ingestion approach does cost more in the short term as it does incur some up-front costs (e.g. deployment of the hub). But, by minimizing the number of data ingestion connections required, it simplifies the environment and achieves a greater level of flexibility to support changing requirements, such as the addition or replacement of data stores.

Another advantage of this approach is the enablement of achieving a level of information governance and standardization over the data ingestion environment, which is impractical in a point to point ingestion environment.

It must be remembered that the hub in question here is a logical hub, otherwise in very large organizations the hub and spoke approach may lead to performance/latency challenges. Therefore a distributed and/or federated approach should be considered. To assist with scalability, distributed hubs address different ingestion mechanisms (e.g. ETL hub, event processing hub). Whereas, employing a federation of hub and spoke architectures enables better routing and load balancing capabilities. Invariably, large organizations’ data ingestion architectures will veer towards a hybrid approach where a distributed/federated hub and spoke architecture is complemented with a minimal set of approved and justified point to point connections.

Here is a high-level view of a hub and spoke ingestion architecture.

Collection Area

The collection area focuses on connecting to the various data sources to acquire and filter the required data. This is the first destination for acquired data that provides a level of isolation between the source and target systems.

This capture process connects and acquires data from various sources using any or all of the available ingestion engines. Data can be captured through a variety of synchronous and asynchronous mechanisms. The mechanisms taken will vary depending on the data source capability, capacity, regulatory compliance, and access requirements. The rate and frequency at which data are acquired and the rate and frequency at which data are refreshed in the hub are driven by business needs.

If multiple targets require data from a data source, then the cumulative data requirements are acquired from the data source at the same time. This minimizes the number of capture processes that need to be executed for a data source and therefore minimizes the impact on the source systems. Looking at the ingestion project pipeline, it is prudent to consider capturing all potentially relevant data. If a target requires aggregated data from multiple data sources, and the rate and frequency at which data can be captured is different for each source, then a landing zone can be utilized. The landing zone enables data to be acquired at various rates, (e.g. in small frequent increments or large bulk transfers), asynchronous to the rate at which data are refreshed for consumption. The data captured in the landing zone will typically be stored and formatted the same as the source data system. The stores in the landing zones can be prefixed with the name of the source system, which assists in keeping data logically segregated and supports data lineage requirements.

If required, data quality capabilities can be applied against the acquired data. Performing this activity in the collection area facilitates minimizing the need to cleanse the same data multiple times for different targets.

Processing Area

The processing area enables the transformation and mediation of data to support target system data format requirements. This requires the processing area to support capabilities such as transformation of structure, encoding and terminology, aggregation, splitting, and enrichment. In addition, the processing area minimizes the impact of change (e.g. change of target and/or source systems data requirements) on the ingestion process.

If both the source and target systems use the same format for the data, and no transformation is required, then it is possible to bypass the processing area. This is quite common when ingesting un/semi-structured data (e.g. log files) where downstream data processing will address transformation requirements. As previously stated, the intent of a hub and spoke approach is to decouple the source systems from the target systems. This means not only decoupling the connectivity, acquisition, and distribution of data, but also the transformation process. Without decoupling data transformation, organizations will end up with point to point transformations which will eventually lead to maintenance challenges.

To circumvent point to point data transformations, the source data can be mapped into a standardized format where the required data transformations take place, upon which the transformed data is then mapped onto the target data structure. This approach does add performance overhead but it has the benefit of controlling costs, and enabling agility. This is achieved by maintaining only one mapping per source and target, and reusing transformation rules.

This standardized format is sometimes known as a canonical data model. It is independent of any structures utilized by any of the source and target systems. It is advantageous to have the canonical data model based on an enterprise data model, although this is not always possible. The enterprise data model typically only covers business-relevant entities and invariably will not cover all entities that are found in all source and target systems. Furthermore, an enterprise data model might not exist. To address these challenges, canonical data models can be based on industry models (when available). This base model can then be customized to the organizations needs.

While it is advantageous to have a single canonical data model, this is not always possible (e.g. cost, size of an organization, diversification of business units). In this instance a pragmatic approach is to adopt a federated approach to canonical data models. For example, each functional domain within a large enterprise could create a domain level canonical data model. Transformations between the domains could then be defined.

Distribution Area

The distribution area focuses on connecting to the various data targets to deliver the appropriate data. This deliver process connects and distributes data to various data targets using a number of mechanisms. Data can be distributed through a variety of synchronous and asynchronous mechanisms. The mechanisms utilized, and the rate and frequency at which data are delivered, will vary depending on the data target capability, capacity, and access requirements.

Initially the deliver process acquires data from the other areas (i.e. collection, processing). This data can be optionally placed in a holding zone before distribution (in case a “store and forward” approach needs to be utilized). The deliver process identifies the target stores based on distribution rules and/or content based routing. This can be as simple as distributing the data to a single target store, or routing specific records to various target stores.

That is more than another for today, as I said earlier I think I will focus more on data ingestion architectures with the aid of opensource projects. See you then.

Good luck, Now Go Architect…